skip to main content


Search for: All records

Creators/Authors contains: "Carraher, Lee A."

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Clustering continues to be an important tool for data engineering and analysis. While advances in deep learning tend to be at the forefront of machine learning, it is only useful for the supervised classification of data sets. Clustering is an essential tool for problems where labeling data sets is either too labor intensive or where there is no agreed upon ground truth. The well studied k-means problem partitions groups of similar vectors into k clusters by iteratively updating the cluster assignment such that it minimizes the within cluster sum of squares metric. Unfortunately k-means can become prohibitive for very large high dimensional data sets as iterative methods often rely on random access to, or multiple passes over, the data set — a requirement that is not often possible for large and potentially unbounded data sets. In this work we explore an randomized, approximate method for clustering called Tree-Walk Random Projection Clustering (TWRP) that is a fast, memory efficient method for finding cluster embedding in high dimensional spaces. TWRP combines random projection with a tree based partitioner to achieve a clustering method that forgoes storing the exhaustive representation of all vectors in the data space and instead performs a bounded search over the implied cluster bifurcation tree represented as approximate vector and count values. The TWRP algorithm is described and experimentally evaluated for scalability and accuracy in the presence of noise against several other well-known algorithms. 
    more » « less
  2. Clustering streaming data has gained importance in recent years due to an expanding opportunity to discover knowledge in widely available data streams. As streams are potentially evolving and unbounded sequence of data objects, clustering algorithms capable of performing fast and incremental processing of data points are necessary. This paper presents a method of clustering high-dimensional data streams using approximate methods called streamingRPHash. streamingRPHash combines random projections with locality-sensitivity hashing to construct a high-performance clustering method. streamingRPHash is amenable to distributed processing frameworks such as Map-Reduce, and also has the benefits of constrained overall complexity growth. This paper describes streamingRPHash algorithm and its various configurations. The clustering performance of streamingRPHash is compared to several alternatives. Experimental results show that streamingRPHash has comparable clustering accuracy and substantially lower runtime and memory usage. 
    more » « less
  3. The size and amount of data captured from numerous sources has created a situation where the large quantity of data challenges our ability to understand the meaning within the data. This has motivated studies for mechanized data analysis and in particular for the clustering, or partitioning, of data into related groups. In fact, the size of the data has grown to the point where it is now often necessary to stream the data through the system for online and high speed analysis. This paper explores the application of approximate methods for the stream clustering of high-dimensional data (feature sizes contains 100+ measures). In particular, the algorithm that has been developed, called streamingRPHash, combines Random Projection with Locality Sensitive Hashing and a count-min sketch to implement a high-performance method for the parallel and distributed clustering of streaming data in a MapReduce framework. streamingRPHash is able to perform clustering at a rate much faster than traditional clustering algorithms such as K-Means. streamingRPHash provides clustering results that are only slightly less accurate than K-Means, but in runtimes that are nearly half that required by K-Means. The performance advantage for streamingRPHash becomes even more significant as the dimensionality of the input data stream increases. Furthermore, the experimental results show that streamingRPHash has a near linear speedup relative to the number of CPU cores. This speedup efficiency is possible because the approximate methods used in streamingRPHash allow independent and largely unsynchronized analyses to be performed on each streamed data vectors. 
    more » « less
  4. This paper presents a solution to the approximate k-means clustering problem for very large distributed datasets. Distributed data models have gained popularity in recent years following the efforts of commercial, academic and government organizations, to make data more widely accessible. Due to the sheer volume of available data, in-memory single-core computation quickly becomes infeasible, requiring distributed multi-processing. Our solution achieves comparable clustering performance to other popular clustering algorithms, with improved overall complexity growth while being amenable to distributed processing frameworks such as Map-Reduce. Our solution also maintains certain guarantees regarding data privacy deanonimization. 
    more » « less